智能论文笔记

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

Bonaventure F. P. Dossou , Atnafu Lambebo Tonja , Oreen Yousuf , Salomey Osei , Abigail Oppong , Iyanuoluwa Shode , Oluwabusayo Olufunke Awoyomi , Chris Chinenye Emezue

分类：自然语言处理 | 人工智能 | 机器学习

2022-11-07

In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.

translated by 谷歌翻译

Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and generalization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlowOut to address these issues. GFlowOut leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlowOut results in predictive distributions that generalize better to out-of-distribution data, and provide uncertainty estimates which lead to better performance in downstream tasks.

translated by 谷歌翻译

语言模型预训练的最新进展利用大规模数据集创建多语言模型。但是，这些数据集中大多遗漏了低资源语言。这主要是因为网络上没有很好地表示口语，因此被排除在用于创建数据集的大规模爬网中。此外，这些模型的下游用户仅限于最初选择用于预训练的语言的选择。这项工作调查了如何最佳利用现有的预培训模型来为16种非洲语言创建低资源翻译系统。我们关注两个问题：1）如何将预训练的模型用于初始预培训中未包含的语言？ 2）生成的翻译模型如何有效地转移到新域？为了回答这些问题，我们创建了一个新的非洲新闻语料库，涵盖16种语言，其中8种语言不属于任何现有评估数据集的一部分。我们证明，将两种语言转移到其他语言和其他领域的最有效策略是，以少量的高质量翻译数据微调大型预训练模型。

translated by 谷歌翻译